Immunogenetics Assessment

Conor M. Finlay - School of Medicine, Trinity College Dublin. - <

2021-11-11

Please refamilarise yourself with workshop 2 before proceeding. Availible on my github or on blackboard.

Setup

Load packages

library(tidyverse)
library(readxl)
library(cowplot)
library(ggbeeswarm)
library(EnhancedVolcano)
library(ggpubr)

Public datasets

Typically data from flow cytometery experiments don’t get uplifted to public datasets. It is much more common to upload proteomic data transcriptomic data. This partly reflects that journals have requirements that data from these experiments should be uploaded to databases such as NCBI. This has led to a transformation in modern biology - with so much data freely available, it is often unnesscary to even do an experiment! More realistically, experiments are still critical but these datsets are used as a form of hypothesis generation. Allowing scientists to test ideas in silico before spending resourses in the lab.

Here we will download a public dataset from NCBI and reanalyse part of that datset. We will use data generated by a microarray which is a well established, if ageing, method of measuring RNA transcripts in a tissue sample. It uses probes to detect mRNA molecules isolated from cells.

The data generated from microarray (and RNA-seq) is typically put through a differential gene expression analysis pipline which compare two groups of samples (e.g. control group, manipulated group) to produce data which contains a fold change and statistical tests for each gene in the array (often 10 of 1000s of genes).

Downlaod the data

So we will use a dataset from a paper I liked. This is a complicated paper but we want to take a take a small subset of the data and use it. This is for a macrophage in the peritoneal cavity called a large peritoneal macrophage (LPM). This is similar to the LCM in my dataset above.

We want to compare the transcriptome between LPM from WT mice given no stimulus and WT LPM given IL-4 complex a cytokine which activates type 2 immune responses and in macrophages is said to lead to ‘alternative activation’ (M2, M(IL-4) activation)

The paper has a link to the data in the methods section of the paper: GSE125727.

This brings us to a NCBI page for the datasets. For microarray datasets we can easily reanalyse these data using an R script built into the website called “Geo2R”. Analysis of RNA-sew data is a little more involved, but we will ignore that in this course.

Geo2R I have highlighted the relevant samples. To start click geo2R.

On the next page, we define our groups. Do this by clicking ‘define groups’ typing 2 names, “naive” and “IL-4c”. To assign samples to those groups, cntrl or shift click the relevant sameps and then click on those groups.

How to assign groups

Then scroll down to the bottom of the page and click “Analyze”. This will tell the server to run the r script using the argument you provided (group information). You can actually view the R script if you like, or even copy it into R to run it on your own computer. Although you will need to install more packages, and the code might be a bit over your head at this stage. The analysis is a version of differential gene expression which looks at genes which are up or down in two groups.

This will take a couple of minutes to run, but eventually you will get a lot of outputs. What it will produce is a table of genes as rows, with various columns including 2 we are interested in, Fold change and P values.

Super cool outputs!

You can download these pictures if you like but what you really want is the table file. So select ‘download full table.’ This will download the a tsv file. A tsv file is like a csv file expect the data is separated by tabs not commas.

Move the file into the data folder for this project.

load in data

Here we will load in the data (assuming you were able to download it and put it in the right place). Unlike before we have to tell R that the tabs separate the data so the function needs another argument, sep="\t"

Your dataset will have a different name so you will have to adjust the file name and location. Hint: use tab complete inside "".

micro <- read.csv(file = "data/GSE125727.top.table.tsv", sep = "\t")

Lets investigate the daraframe

dim(micro)
[1] 30177     8

So we have a lot or rows and 8 columns. Let’s look at the columns first.

colnames(micro)
[1] "ID"          "adj.P.Val"   "P.Value"     "t"           "B"          
[6] "logFC"       "Gene.symbol" "Gene.title" 

Hmmm, this looks fairly understandable. The cols we will be interested in are "logFC’ or log base 2 of fold change, and adjusted P value. We could select only that data to simplify matters.

However the gene symbols may change, because this data is supplied by the original scicentists and they may have come up with their own name. We next print out the table to see if we can spot gene names, which should be in the format of Gapdh for mouse and GAPDH for huamn.

(head((micro)))
        ID adj.P.Val  P.Value         t        B     logFC Gene.symbol
1 10475517  2.28e-07 7.55e-12 -67.51334 15.19131 -7.661585    AA467197
2 10419568  9.67e-07 6.41e-11 -50.94326 14.26932 -8.391100     Rnase2a
3 10501026  2.96e-06 2.94e-10 -41.67332 13.41073 -5.504030       Chil4
4 10476945  4.65e-06 7.88e-10 -36.58754 12.76415 -5.747173        Cst7
5 10534927  4.65e-06 8.49e-10  36.22528 12.71189  3.885703       Pilra
6 10378793  4.65e-06 1.04e-09 -35.27119 12.56978 -4.028273      Tmigd1
                                                               Gene.title
1                                             expressed sequence AA467197
2 ribonuclease, RNase A family, 2A (liver, eosinophil-derived neurotoxin)
3                                                        chitinase-like 4
4                                              cystatin F (leukocystatin)
5                          paired immunoglobin-like type 2 receptor alpha
6                    transmembrane and immunoglobulin domain containing 1

Sow where is your Gene symbols? What is the column title, does it have no title?

If this is the case you can adapt the following code to rename your variable to Gene.symbol

# remove "#" to to run lines of code
# micro$Gene.symbol <- micro$[the namme of your col containing gene symbols]

# common examples below
#micro$Gene.symbol <- micro$ORF
micro <- micro %>% select(c("P.Value","logFC","Gene.symbol"))

Now our dataset micro contains only three cols. The dataset contains 30177 rows, likely corresponding to genes (in reality they correspond to probes not genes, so the same gene may appear multiple times). We would not write out 30177 genes but we can use the head function to look at the first few.

head(micro$Gene.symbol)
[1] "AA467197" "Rnase2a"  "Chil4"    "Cst7"     "Pilra"    "Tmigd1"  

Cool! They look like genes to me! Chil4 is one of my favourite mouse genes too!

volcano plot

We will produce a basic volcano plot in R using ggplot2. We will need one more R package called ggrepel. Install this using install.packages("ggrepel").

library(ggrepel)

The most basic volcano plot can be made with the following code:

micro %>%
  ggplot(aes(x =`logFC`,
             y=-log(`P.Value`) # do you notice the -log function?
               )) +
  geom_point() + theme_cowplot()

We can of course make this look better! Lets creat a new col that that can be used to highlight certain highly differential genes. The code from here on down might be a bit more complicated.

You can change the p value cutoffs here if you like reducing or increasing them to suit your dataset

If your dataset is very different and the values are not that different, you may want to reduce the LofFC >= 1 or something similar. Just because you don’t have log2FC of +5 or -7 doesn’t mean there isn’t differences, perhaps the amount of mRNA captured was less or the difference between the groups was subtle (ofthen the case for KO cells when the cell type is the same, but one gene is lost.)

# mark up/down regulated genes
micro <- micro %>%
  mutate(gene_type = case_when(logFC >= 1 & P.Value <= (0.05) ~ "up",
                               logFC <= -1 & P.Value <= (0.05) ~ "down",
                               TRUE ~ "ns"))   

So what did that create? Let’s look

head(micro$gene_type)
[1] "down" "down" "down" "down" "up"   "down"

We will also create some variable that we can use for advanced plotting

# Add colour, size and alpha (transparency) to volcano plot --------------------
cols <- c("up" = "#ffad73", "down" = "#26b3ff", "ns" = "grey") 
sizes <- c("up" = 2, "down" = 2, "ns" = 1) 
alphas <- c("up" = 1, "down" = 1, "ns" = 0.4)
micro %>%
  ggplot(aes(x =`logFC`,
             y=-log(`P.Value`),
             fill = gene_type,    
             size = gene_type,
             alpha = gene_type)) +
  ggtitle('My title') +
  xlab('Log2 Fold Change') +
  geom_point(shape = 21, # Specify shape and colour as fixed local parameters    
             colour = "black") + 
  geom_hline(yintercept = -log10(0.001),
             linetype = "dashed") + 
  geom_vline(xintercept = c(-2, 2),
             linetype = "dashed") +
  scale_fill_manual(values = cols) + # Modify point colour
  scale_size_manual(values = sizes) + # Modify point size
  scale_alpha_manual(values = alphas) + # Modify point transparency
 theme_cowplot() +
  theme(legend.position = 'none')

That looks nicer. I won’t describe each line above, but you can read each line and see if you can work out what it does. You can comment out a line in the code using a hash # at the start of a line to stop the code being run, but without deleting it. Try commenting out various lines to see what that does.

Adding text to the volcano plot

The last thing we want to do is add text to the graph. Now we do not want to label all of the genes or our plot will look terrible and take hours to run. We just want to highlight the top few differential gene and highlight them on the graph.

First lets select the most differential genes. First arrange by LogFC ignoring minus sign using the abs() function and the desc() to place them in decending order.

micro <- micro %>% 
  arrange( desc(abs(logFC)))
head(micro, 20)
    P.Value     logFC   Gene.symbol gene_type
1  6.41e-11 -8.391100       Rnase2a      down
2  7.55e-12 -7.661585      AA467197      down
3  8.17e-09 -6.095598          Mgl2      down
4  7.88e-10 -5.747173          Cst7      down
5  5.79e-09 -5.659000         Chil3      down
6  2.94e-10 -5.504030         Chil4      down
7  1.66e-09 -5.436653         Timd2      down
8  4.82e-08  5.359922          Fpr1        up
9  8.15e-08  5.320450        Cxcl13        up
10 2.17e-07 -5.071113          Ucp1      down
11 1.98e-07 -5.016402       Klk1b11      down
12 2.40e-08 -4.988765 I830127L07Rik      down
13 2.50e-08 -4.988538          Mall      down
14 1.03e-07 -4.849938        Gm5150      down
15 3.07e-09 -4.813952          Spp1      down
16 6.80e-09 -4.772435          Gatm      down
17 7.35e-08  4.520842         Cxcl2        up
18 7.51e-07  4.516250          Ccl3        up
 [ reached 'max' / getOption("max.print") -- omitted 2 rows ]

Now get the 20 top genes here.

genes_to_graph <- head(micro$Gene.symbol, 16)

What if we wanted to add some genes we like to this? Well we can just append the vector with some more gene symbols

Next we want to create a new object which is a subset of micro that contains only the genes we are interested in. This is useful for telling ggplot to label only specific points. There are actually many ways of doing this and this is atually not a very intuitive method, but it works.

identify <- micro %>% filter(Gene.symbol %in% genes_to_graph  )

Finally we graph this.

micro %>%
  ggplot(aes(x =-`logFC`,
             y=-log(`P.Value`),
             fill = gene_type,    
             size = gene_type,
             alpha = gene_type)) +
  ggtitle('My title') + # cjamge plot title
  xlab('Log2 Fold Change') + # x axis label
  geom_point(shape = 21, # Specify shape and colour as fixed local parameters    
             colour = "black") + 
  geom_hline(yintercept = -log10(0.001), # horizontal line
             linetype = "dashed") + 
  geom_vline(xintercept = c(-2, 2), #vertical line
             linetype = "dashed") +
  scale_fill_manual(values = cols) + # Modify point colour
  scale_size_manual(values = sizes) + # Modify point size
  scale_alpha_manual(values = alphas) + # Modify point transparency
    geom_label_repel(data = identify,size=5, # Add labels last to appear as the top layer  
                   aes(label = Gene.symbol),
                   force = 3,
                   nudge_y = 1,max.overlaps = 200) +
 theme_cowplot(font_size = 18) +
  theme(legend.position = 'none')

ggsave(filename = "volcano_.png",width = 8,height = 5,dpi = 300)

Try to reverse engineer some of that code, removing some lines and changing values.

Advanced analysis

The following packages maybe required for the more advanced analysis in using Jamie’s code or if you want to run geo2r’s R script yourself

You can find the Rscript here

Just copy this into R and run it and it should work

load additonal packages

These can be installed with the functions

if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("GEOquery") BiocManager::install("limma") install.packages(“umap”) install.packages("RcolorBrewer") install.packages("ggrepel") install.packages("ggfortify")

library(GEOquery)
library(limma)
library(RColorBrewer)
library(pheatmap)
library(umap)
library(ggrepel) 
library(ggfortify)
#if you get errors try running this code
#Sys.setenv("VROOM_CONNECTION_SIZE" = 131072 * 4)
#readr::local_edition(1)

Of course you are welcome to produce your own analysis! Just google mciroarray analysis in R.

Assessment

Each group will be given a paper.

In each paper there is a microarray dataset. The GSE number for the data set is in the paper. Search for this dataset on the NCBI website.

You are asked to analyse the dataset using the online GEO2R tool. You must define groups for this to work. It is critical that you define only 2 groups. Where more than two groups exist use your judgment from reading the paper as to which group comparison is relevant. It might be more interesting to choose groups which are very different than each other, e.g. stim vs no stim. Please download your table as a csv file by clicking on ‘Download full table’. Save this file as ‘x.csv’

Read this into R and produce some analysis. E.g. volcano plot.